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Abstract 


This  paper  bridges  the  gap  between  variable  selection  methods  (e.g.,  Pearson  coefficients,  KS  test)  and  dimensionality  reduc¬ 
tion  algorithms  (e.g.,  PC  A,  LDA).  Variable  selection  algorithms  encounter  difficulties  dealing  with  highly  correlated  data, 
since  many  features  are  similar  in  quality.  Dimensionality  reduction  algorithms  tend  to  combine  all  variables  and  cannot 
select  a  subset  of  significant  variables. 

Our  approach  combines  both  methodologies  by  applying  variable  selection  followed  by  dimensionality  reduction.  This 
combination  makes  sense  only  when  using  the  same  utility  function  in  both  stages,  which  we  do.  The  resulting  algorithm 
benefits  from  complex  features  as  variable  selection  algorithms  do,  and  at  the  same  time  enjoys  the  benefits  of  dimensionality 
reduction.1 
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1  Introduction 

The  question  of  which  kind  of  measurements  to  use  is 
fundamental  to  the  design  of  any  visual  learning  system, 
whether  it  is  a  supervised  system  or  not.  Since  gray  level 
features  are  sometimes  limited  in  learning  objects,  e.g. 
faces  [18],  often  other,  more  informative  variables  are  used. 
Examples  include  wavelets  [16],  wavelet-like  features  [22] 
and  statistics  of  pixels  around  interest  points  [7,  12]. 

These  highly  informative  measurements  can  be  exploited 
in  the  supervised  case  by  feeding  them  directly  into  a  clas¬ 
sifier,  as  done  in  the  above  citations.  Alternatively,  one 
can  compute  the  mutual  information  (or  other  scores)  be¬ 
tween  the  labels  and  single  variables  [21]  or  subsets  of  the 
variables  [17].  These  methods  and  other  variable  selec¬ 
tion  methods  benefit  from  having  a  set  of  informative  fea¬ 
tures.  However,  dimensionality  reduction  algorithms,  such 
as  PCA  and  FDA,  do  not,  in  general,  benefit  from  using 
informative,  uncorrelated  features.  A  wavelet  transform, 
for  example,  can  be  seen  as  a  rotation  in  the  input  space 
[10],  which  no  distance  based  algorithm  (e.g.,  PCA,  LDA, 
ISOMAP,  LLE)  benefits  from. 

On  the  other  hand,  while  dimensionality  reduction  algo¬ 
rithms  do  well  on  sets  of  correlated  features,  variable  se¬ 
lection  methods  perform  poorly.  They  fail  to  pick  relevant 
variables,  because  the  score  they  assign  to  correlated  fea¬ 
tures  is  too  similar,  and  none  of  the  variables  is  strongly 
preferred  over  another. 

Hence,  variable  selection  and  dimensionality  reduction 
algorithms  have  complementary  advantages  and  disadvan¬ 
tages.  Dimensionality  reduction  algorithms  thrive  on  corre¬ 
lation  between  variables  but  fail  to  select  informative  fea¬ 
tures  from  a  set  of  more  complex  features.  Variable  selec¬ 
tion  algorithms  fail  when  all  the  variables  are  correlated  but 
do  well  with  informative  variables.  The  scheme  we  propose 
is  simple:  first  extract  informative  features  from  the  data. 
Then,  apply  a  variable  selection  algorithm.  Last,  apply  di¬ 
mensionality  reduction  to  extract  the  most  informative  di¬ 
rections.  The  key  point  here  is  to  use  the  same  optimization 
function  at  all  stages.  This  guarantees  that  the  maximization 
of  the  dimensionality  reduction  stage  is  not  undermined  by 
the  feature  selection  stage  but  instead  benefits.  When  com¬ 
bining  a  random  pick  of  a  feature  selection  algorithm  with  a 
dimensionality  reduction  algorithm,  the  resulting  optimiza¬ 
tion  is  not  clear.  We  show  in  the  experiments  that  in  this 
case  the  results  are  considerably  worse. 

The  combined  optimization  function  searches  for  the 
variable  selection  that  maximizes  the  information  content 
of  the  data’s  most  informative  directions.  We  use  a  simple 
algebraic  definition,  which  similar  to  principle  component 
analysis  (PCA),  uses  the  variance  of  a  direction  as  its  in¬ 
formation  content  score.  Instead  of  strict  (binary)  variable 
selection,  which  makes  our  optimization  an  intractable  enu¬ 


meration  problem,  we  suggest  a  variable  weighting  scheme. 

2.  Motivation 

We  find  that  the  role  of  feature  selection  is  not  well  under¬ 
stood,  even  by  some  very  experienced  researchers.  For  ex¬ 
ample,  many  people  believe  that  SVM  will  always  pick  a 
hyperplane  which  uses  only  the  relevant  features  because  “it 
chooses  the  best  hyperplane”.  A  more  sophisticated  claim 
would  say  that  the  generalization  ability  of  the  SVM  de¬ 
pends  on  the  margin,  which  does  not  change  with  the  ad¬ 
dition  of  irrelevant  features.  These  claims  ignore,  however, 
the  dependence  of  the  SVM  performance  on  the  radius  of 
the  data,  which  increases  with  the  number  of  features.  More 
background  on  variable  selection  in  the  supervised  setting 
can  be  found  in  [3,  23]. 

Since  there  are  fewer  algorithms  for  feature  selection  in 
the  unsupervised  setting,  this  case  is  even  less  understood. 
In  Fig.  1,  we  demonstrate  the  importance  of  feature  selec¬ 
tion  when  combined  with  PCA.  The  data  was  generated  by 
two  3D  Gaussians  with  random  parameters.  Additional  200 
dimensions  were  generated  in  a  similar  manner.  These  di¬ 
mensions  were  each  permuted  separately,  to  remove  any 
class  membership  information  in  those  dimensions2. 

As  can  be  seen  from  Fig.  1,  PCA  does  not  perform  well 
in  the  presence  of  the  irrelevant  features,  i.e  it  fails  to  sepa¬ 
rate  the  clusters.  An  application  of  the  simple  Algo.  2  pre¬ 
sented  below  solves  the  problem,  and  the  good  separation 
between  the  clusters  is  clear.  This  was  done  without  using 
any  label  information. 

The  loss  of  rotation  invariance.  The  algorithms  we  use 
in  this  work  weight  each  feature  separately.  They  are  not 
invariant  to  the  change  of  coordinates.  This  might  seem 
troubling  at  first,  as  PCA  and  LDA  are  both  kernel  methods 
(methods  that  depend  only  on  the  similarity  matrix  ( kernel ) 
between  every  two  data  points)  and  enjoy  invariance  to  uni¬ 
tary  transformations  of  the  data.  However,  in  the  context  of 
feature  selection,  it  can  be  shown  that  rotation  invariance  is 
not  a  desirable  property. 

For  the  supervised  case,  Ng  [15]  has  shown  that  for  any 
rotationally  invariant  classification  algorithm,  there  exist  at 
least  one  classification  problem  that  is  solvable  by  simply 
thresholding  one  variable,  but  which  cannot  be  learned  by 
this  algorithm  with  a  reasonably  sized  dataset:  The  required 
size  will  be  linear  in  the  total  number  of  features,  while 
some  feature  selection  algorithms  will  need  only  the  loga¬ 
rithm  of  it.  In  the  unsupervised  case,  the  situation  is  similar. 
If  an  algorithm  is  rotationally  invariant,  then  all  of  the  infor¬ 
mation  except  the  data’s  dimension  is  contained  in  the  dis¬ 
tances  between  every  two  points  (i.e,  in  the  kernel  matrix). 

2Matlab:  M  =  [rmvrnd(rand(203, 1)  —  0.5,  diag(rand(203, 1)  + 
.5,  50))',  mvnrnd(rand(203, 1)  —  0.5,  diag(rand(203, 1)  + 

.5,50 ))'];fori  =  4  :  203,  M(z, :)  =  M(i,randperm(  100)); 
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Figure  1 :  This  figure  demonstrates  the  importance  of  feature  selection  in  the  unsupervised  setting,  and  that  even  though  our  algorithms 
and  PC  A  use  similar  optimization  functions,  they  are  very  different  in  their  capabilities  of  dealing  with  irrelevant  variables:  (a)  Three 
relevant  dimensions  out  of  203.  The  rest  of  the  dimensions  are  similar  but  were  permuted  to  remove  any  class  membership  information; 
(b)  The  first  2  principle  components  of  only  the  3  relevant  dimensions;  (c)  PC  A  of  the  whole  203  dimensions;  (d)  The  results  of  applying 
PCA  after  weighting  according  to  the  a  weights  recovered  by  Algo.  1. 


However,  the  kernel  matrix  is  corrupted  by  the  irrelevant 
features.  In  fact,  for  a  linear  kernel,  where  the  similarity 
between  every  two  points  is  just  the  dot  product,  the  ker¬ 
nel  matrix  is  the  sum  of  the  kernel  matrix  of  the  relevant 
features,  and  of  that  of  the  irrelevant  features.  Decompos¬ 
ing  the  kernel  back  to  the  two  matrices  is  almost  impossible 
when  dealing  with  a  large  number  of  irrelevant  features. 

Using  random  matrix  theory,  we  can  try  estimate  the 
severity  of  the  situation.  Let  our  q  data  points  be  repre¬ 
sented  as  the  columns  of  the  matrix  M  =  [M1T,  M2T]T, 
where  Ml  is  an  n\  x  q  matrix  containing  the  n\  relevant 
variables  of  our  data,  and  M2  is  an  n2  x  q  matrix  containing 
the  irrelevant  data.  Assume  further  that  every  feature  (row 
of  the  matrix  M)  is  normalized  to  have  a  variance  of  1.  The 
kernel  matrix  A  =  MTM,  is  the  sum  of  A1  =  Mj  M\  and 
A2  =  Mj  M2  (throughout  this  work  A  will  denote  such  a 
kernel  matrix  of  the  form  A  =  MTM,  and  B  will  denote 
its  dual  -  the  covariance  matrix  B  =  MMT.  In  many  cases 
we  will  use  the  fact  that  they  share  the  same  eigenvalues  to 
switch  between  the  two). 

In  what  is  the  ideal  situation  for  most  spectral  clustering 
algorithms,  the  relevant  data  is  composed  out  of  two  equally 
sized  very  tight  clusters.  These  clusters  would  be  centered 
around  two  points  positioned  as  far  as  possible  from  each 
other  in  Jini ,  e.g,  one  cluster  might  contain  q/2  points  equal 
to  [1, 1, 1  ]T /sqrt(q),  and  the  other  cluster  might  con¬ 
tain  q/2  copies  of  the  point  [— 1,  —  1, —  1  ]T /sqrt(q).  The 
kernel  matrix  of  the  relevant  data  A\  will  be  a  rank  1  matrix 
with  an  eigenvalue  n\. 

Assume  now  that  in  our  ideal  situation,  the  irrelevant 
variables  (the  elements  of  the  matrix  M2)  are  i.i.d  zero  mean 
Gaussians  with  variance  1/q  (so  the  variance  of  each  row 
has  an  expectation  of  1).  The  matrix  A2  here  is  called  a 
Wishart  matrix ,  and  its  first  eigenvalue  distributes  with  a 
mean  of  n2/q2(y/n^  +  y^)2  [8]. 


Let  ri  be  the  only  eigenvector  of  A\  with  a  positive 
eigenvalue.  Rotating  M2  by  multiplying  it  from  the  right 
with  a  q  x  q  rotation  matrix  R ,  would  not  change  its  dis¬ 
tribution  (follows  from  the  basic  properties  of  Gaussians). 
Hence  vj A2v  1  has  the  same  distribution  as  ej A2e  1,  e\  be¬ 
ing  a  vector  of  1  followed  by  q  —  1  zeros.  Thus,  vj A2v\ 
has  the  same  distribution  as  the  top  left  element  of  A2.  This 
element  is  just  the  square  norm  of  a  vector  of  n2  random 
Gaussians  with  mean  zero  and  variance  1/q ,  and  distributes 
according  to  a  Chi- squared  distribution  around  a  mean  of 
n2/q2. 

Since  vj Av  1  =  vj  A\V\  +vj  A2v  1,  it  distributes  around 
a  mean  of  n\  +  n2/q2.  Similarly,  let  u\  be  the  first  eigen¬ 
vector  of  A2.  Then  uj Au\  >  u[A2u\,  which  concen¬ 
trates  around  n2/c^^y/n  +  y^)2.  When  rt\  +  n2/q2  « 
n2 / q2  [y/aA  +  y^)2,  vi  is  not  distinguishable  from  u\  and 
any  unsupervised  kernel  method  would  fail. 

The  situation  is  worse  in  less  ideal  cases,  where  the  real 
data  is  noisy,  and  the  irrelevant  data  has  a  non-random  struc¬ 
ture.  When  the  number  of  irrelevant  features  grows,  rota- 
tionally  invariant  algorithms  will  fail  since  they  only  see 
the  matrix  A.  Non  rotationally  invariant  algorithms  can 
identify  a  group  of  features  that  lead  to  a  decomposition 
A  =  Ai  +  A2  such  that  A\  is  not  likely  to  be  a  random 
matrix  and  A2  is  likely  to  be  a  random  matrix. 

3.  Weighting  Variables  for  Unsupervised  Di¬ 
mensionality  Reduction 

Given  a  dataset  where  each  example  is  a  vector  in  5Rn,  our 
goal  is  to  weigh  the  variables  such  that  the  most  informative 
directions  in  the  reweighed  data  contain  as  much  informa¬ 
tion  as  possible.  Similarly  to  the  PCA  approach,  we  use  the 
variance  along  this  direction  as  a  score  for  its  information 
content.  We  combine  these  scores  for  the  dataset’s  k  most 
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informative  directions  to  obtain  the  information  content  of 
the  whole  dataset. 

Let  M  be  our  data  matrix  with  n  rows,  each  represent¬ 
ing  a  variable,  and  q  data  samples  as  columns.  We  then 
weigh  each  variable  i  by  a  positive  weight  y/ai  3.  Let 
D  =  diag(y/a)  be  the  diagonal  weighting  matrix  and  let 
M  =  DM  be  our  reweighed  data  matrix. 

A  projection  of  the  weighted  data  onto  a  direction  vec¬ 
tor  w  (a  vector  on  the  n-dim  sphere)  produces  a  new  vari¬ 
able  v  =  wT M  with  variance  Yli= iCK^)  —  fi)2-  Since  we 
are  only  concerned  with  the  variances,  we  can  subtract  the 
average  of  each  variable  from  the  data  matrix  and  assume 
that  the  mean  of  each  row  in  M  is  0.  Therefore,  the  means 
of  each  row  in  M  are  also  zero,  as  well  as  the  mean  of  v. 
Hence,  the  variance  of  the  new  variable  v  is  given  by  the 
simple  dot  product  vTv. 

We  wish  to  find  a  norm  1  weight  vector  a  and  k  or¬ 
thonormal  directions  W  =  [wi\w2\ws\...\wk]  which  maxi¬ 
mize  the  score:  1 1  ( var  (yi ) ,  var  (v2 ) ,  •  •  • ,  var  (vk ) )  1 1 2 ,  where 
Vi  =  wjM ,  i  =  l..k  are  the  projections  of  the  weighted 
data  onto  those  directions. 

This  is  equivalent  to  maximizing 
J2i=i(var(wf  DM))2  =  (w?DMMTDWi)2  = 

=  T,ki=1w'[(DMMTD)(DMMTD)wi.  Setting  the 
matrix  Ba  =  DMMTD ,  this  amounts  to  maximizing 
trac  e(WTBaBaW)  =  \\WTBa\\2F. 

This  trace  is  maximized  for  any  fixed  a  when  W  is  just 
the  first  k  columns  of  U,  and  U diag(  [or ,  a 2 , . . . ,  crn\ )  V  is  the 
S  VD  of  the  matrix  Ba  4  [9] .  The  values  of  the  maximization 
function  would  in  that  case  be:  J2i= 1  ai- 

The  singular  values  of  the  positive  definite  matrix  Ba  = 
MMt  are  just  its  eigenvalues.  These  eigenvalues  are  the 
squares  of  the  singular  values  of  the  matrix  M,  which  are 
also  the  square  roots  of  the  eigenvalues  of  Aa  =  MT  M  = 
MTdiag(a)M. 

This  relation  between  the  right  and  left  singular  values 
of  the  matrix  M  is  similar  to  the  use  of  the  kernel  trick  for 
many  algorithms.  The  pairs  of  matrices  Aa/Ba  and  Q/W 
are  pairs  of  duals  for  this  optimization  problem.  By  this  du¬ 
ality  we  can  maximize  the  expressions  involving  Ba  with¬ 
out  having  to  deal  with  the  square  roots  in  the  matrix  D. 

From  the  above  discussion,  the  maximization  of 
trace (WT BaBaW)  subject  to  W  being  orthonormal  and 
a  having  a  norm  of  1,  is  equivalent  to  the  maximization  of 
trace (QT AaAaQ)  for  an  orthonormal  Q  and  the  same  a. 
An  algorithm  for  this  optimization  problem  was  suggested 
in  [24],  where  the  problem  of  finding  a  set  of  features  which 
is  optimal  for  clustering  into  k  clusters  was  considered.  This 
algorithm  is  termed  the  Q  —  a  algorithm,  and  is  embedded 

3  The  representation  of  the  vector  a  as  the  vector  of  the  element  wise 
squares  of  the  actual  weights  (-y/a)  is  done  in  order  to  make  the  resulting 
optimization  problem  bilinear. 

4W  can  be  any  other  orthonormal  basis  of  this  k- dim  subspace  as  well. 


in  step  two  of  the  algorithm  below. 

Algorithm  1  (Weighting  for  PCA)  Let  N  be  an  n  x  q  in¬ 
put  matrix.  Perform  the  following  steps: 

1.  Center  the  data  to  have  a  mean  of  0  and  normalize  it  to  have 
a  norm  of  1  in  each  variable.  Let  M  be  the  normalized  data 
matrix,  with  rows  m^, ...,  mj. 

2.  Let  Q be  some  orthonormal  q  x  k  matrix.  Perform  the 
following  steps  with  iterations  index  r  —  1,2,...: 

(a)  Let  G ^  be  a  matrix  whose  (i,  j)  components  are 
KT„iJ)mtTQ(r-1)Q(r-1)TmJ. 

(b)  Let  be  the  leading  eigenvector  of  G(r> . 

(c)  LetA(r)  =  y"*=1  af)mimlT. 

(d)  Let  Q(r>  be  the  first  k  eigenvectors  of  Aa. 

(e)  Increment  index  r  and  go  to  step  a. 

3.  Reweight  the  data  M  —  diag(^/a)M. 

4.  Compute  the  Principle  Components  W  of  the  matrix  M. 

From  the  discussion  in  [24],  the  resulting  weight  vec¬ 
tor  a  would  be  sparse  and  composed  out  of  positive  ele¬ 
ments.  Note  that  although  our  goal  is  somewhat  different 
than  the  one  suggested  there  (increasing  information  con¬ 
tent,  and  not  supporting  k  good  clusters),  we  get  a  similar 
algorithm.  The  main  difference  is  the  centering  of  the  data, 
which  seems  to  have  a  large  significance  in  practice. 

3.1.  A  parameter  free  algorithm 

Algo.  1  above  is  an  iterative  algorithm  which  converges  to 
a  local  maxima.  This  iterative  nature  allows  it  to  search 
for  a  subset  of  features  among  a  large  amount  of  irrelevant 
data.  However,  sometimes  we  might  prefer  a  non-iterative 
algorithm  which  is  guaranteed  to  return  the  same  solution  in 
every  run.  Also,  in  many  situations  we  are  willing  to  sacri¬ 
fice  some  accuracy  of  the  returned  weights  for  the  ability  to 
work  faster  with  many  more  variables.  The  algorithm  pre¬ 
sented  next  is  an  approximation  of  Algo.  1  which  archives 
both  goals,  and  is  also  parameter  free. 

The  basic  idea  behind  the  approximation  algorithm  is 
simple.  When  we  optimize  for  second  order  expressions 
like  Yli=i(var(wI DM))2,  those  components  with  the 
largest  values  are  going  to  dominate  the  expression.  For 
that  reason,  the  result  is  not  going  to  change  much  whether 
we  are  optimizing  the  first  k  most  informative  directions, 
or  any  l  >  k  most  informative  directions.  The  score  we 
maximize  is  probably  going  to  remain  high  for  successful 
weighting  and  low  for  unsuccessful  ones.  Moreover,  most 
of  the  time  we  do  not  have  any  information  about  the  cor¬ 
rect  parameter  k  in  advance,  and  we  might  want  to  increase 
the  information  content  in  all  possible  directions. 

Based  on  these  intuitions  we  explore  the  case  where 
k  =  min(q,ri).  Since  the  trace  of  a  matrix  is  the  sum 
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of  its  eigenvalues,  the  optimization  problem  of  maximizing 
trace (QT AaAaQ)  becomes  maximizing  trace(AaAa)  = 
trace(MTdiag(a)MMTdiag(a')M).  This  expression  is  bi¬ 
linear  in  a  and  can  be  written  as  maximizing  aT Ha,  where 
Hij  =  trac  rrijirij)  =  {mjrrij )  trace  {m^mj)  = 

(mjrrij)2. 

Algorithm  2  (Parameter  Free  Weighting  for  PC  A)  Let 

M  be  an  n  x  q  normalized  and  centered  input  data  matrix. 
Perform  the  following  steps 

1.  Let  H  —  (MMT).2,  where  the  ?  means  taking  the  square 
of  each  element  separately. 

2.  Let  a  be  the  leading  eigenvector  of  H. 

3.  Re -weight  the  data  and  apply  PCA  on  M  —  diag{^/a)M 

We  implement  both  steps  of  the  algorithm  in  a  parallel 
scheme.  In  each  stage,  only  a  part  of  the  matrix  H  is  loaded 
into  memory.  This  enables  us  to  handle  up  to  105  features 
and  hundreds  of  data  points,  while  still  maintaining  a  rea¬ 
sonable  running  time. 

The  results  of  Algo.  2  are  less  sharp  than  those  of  Algo. 
1 .  The  ratio  of  the  average  weights  between  the  relevant  and 
irrelevant  features  is  lower  with  parameter  free  algorithm. 
An  example  is  shown  in  Fig.  2.  In  this  experiment,  the  first 
3  variables  are  chosen  from  4  multivariate  normal  distribu¬ 
tions  with  diagonal  covariance  matrices.  The  remainder  of 
the  200  variables  are  selected  from  the  same  distributions 
but  are  each  permuted  independently.  In  this  way,  the  re¬ 
maining  200  features  give  no  information  about  the  under¬ 
lying  cluster  from  which  the  data  points  stem.  As  it  follows, 
the  assignment  of  weights  by  taking  the  first  PCA  of  the 
data  is  uninformative  for  feature  selection  purposes.  Algo. 
1  assigns  very  high  weights  to  the  relevant  variables,  and 
much  less  to  the  rest.  Algo.  2  successfully  detects  the  rele¬ 
vant  variables,  but  assigns  somewhat  higher  weights  to  the 
irrelevant  variables. 

4.  Supervised  variable  selection  for  FDA 

We  next  describe  how  to  handle  the  supervised  case.  In  [24] 
a  different  solution  is  suggested  using  a  kernel  as  close  as 
possible  to  the  ideal  block  diagonal  kernel.  Here  we  take 
a  different  approach,  viewing  the  matrix  Aa  as  the  ker¬ 
nel  for  a  Kernel  Fisher’s  Discriminant  (KFD)  analysis  [14]. 
The  kernel  case  of  Fisher  Discriminant  Analysis  (FDA)  was 
treated  in  the  past  for  two  class  classification  and  for  regres¬ 
sion.  We  extend  it  to  the  multi-class  case,  then  match  it  with 
the  appropriate  variable  selection.  Due  to  space  constraints, 
the  derivations  are  moved  to  the  appendix. 

Algorithm  3  (Supervised  Variable  Selection)  Let  M  be 

an  n  x  q  input  data  matrix  containing  samples  from  l 
classes.  Let  MW  be  an  nx  q $  matrix  containing  only  those 
samples  taken  from  class  i.  Perform  the  following  steps 


Figure  2:  Comparison  of  on  a  synthetic  dataset  where  the  first  3 
variable  are  relevant.  Only  the  first  50  variables  are  shown.  Algo.  1 
produces  sharper  results  than  Algo.  2.  The  first  PCA  does  not 
detect  the  relevant  variables. 

1.  Perform  an  eigen-decomposition  of  the  matrix  A  = 
MtM  =  EDEt 

2.  Compute  the  means  p  —  ^ Mlq ,  p^  — 

3.  Let  H  =  Y!i=1  qi((MED~1ETMT).  *  (ppT  -  //« P  - 
pp^T  +  tip)),  where  the  .*  means  multiplying  each  ele- 
ment  separately 

4.  Let  G  =  J2li=1qi((MED~1ETMT).  *  (M(i)M(i)T  - 

5.  Let  a  be  the  leading  eigenvector  of  H  —  A G 

5.  Analysis 

The  choice  of  the  optimization  criteria.  The  optimization 
criteria  used  for  the  weighting  is  based  on  a  simple  algebraic 
definition  of  an  informative  score  of  a  data  matrix.  It  gives 
rise  to  a  simple  quadratic  form  which  is  easy  to  solve.  De¬ 
scribed  using  the  eigenvalues  cr1? cr2, ...  of  the  matrix  Ba , 
the  optimization  criteria  simply  becomes  Yli=i  ai-  This 
utility  function  is  closely  related  to  the  square  loss  function 
in  the  supervised  case.  In  general,  we  can  try  to  optimize 
similar  utility  functions  Yli=i  )>  where  /  is  any  (mono¬ 
tonic)  function. 

For  simplicity,  we  are  going  to  refer  in  the  discussion 
below  only  to  the  case  where  k  =  min(q ,  n)  (the  case  of 
Algo.  2).  If  F  is  the  matrix  version  of  the  function  /  then  it 
is  easy  to  show  that  trace(F(T>))  =  JA  /(cr^).  The  result¬ 
ing  optimization  problem  might  be  hard  to  solve  directly. 
However,  it  can  be  approximated  using  the  Taylor  series  of 
the  function  /  up  to  any  order.  If  there  are  not  too  many 
variables,  one  can  use  quadratic  programming  in  order  to 
solve  these  approximations. 
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Why  not  use  f(x)  m  x  ?  There  is  a  special  case  where 
the  function  /  is  the  identity  function.  This  choice  of  / 
would  result  in  a  weighting  scheme  that  simply  tries  to  max¬ 
imize  the  trace  of  the  kernel  matrix  Aa .  The  problem  would 
transform  into  max  aTu  where  each  element  of  the  vector 
u  is  just  Ui  =  trace =  mjrrii.  The  maximum  is 
obtained  when  a  =  —?==  • 


However,  kernels  with  a  large  trace  (such  as  those  re¬ 
turned  by  the  previous  optimization  criteria)  are  bad  for 
generalization  [4]  (this  is  not  obvious  without  reading  the 
reference.  As  an  intuition:  for  kernels  with  large  elements 
along  the  diagonal,  each  point  is  separated  from  all  other 
points).  Since  generalization  is  crucial  for  unsupervised 
feature  selection,  just  as  it  is  in  the  supervised  case,  we 
would  like  to  achieve  good  clustering,  but  we  would  not 
like  to  use  a  kernel  Aa  that  supports  any  random  clustering. 

Note  that  in  both  Algo.  1  and  Algo.  2,  the  trace  of  the 
kernel  (Aa)  is  controlled  by  the  constraint  on  a.  The  trace 
of  the  kernel  is  the  same  as  the  trace  of  the  covariance  ma¬ 
trix  Ba.  The  covariance  matrix  is  a  sum  of  rank-1  matrices, 
each  with  a  trace  of  1  (all  features  are  normalized  to  have 
a  norm  of  1)  but  weighted  by  a  corresponding  weight  a2. 
Since  the  trace  operator  is  linear,  the  trace  of  the  covariance 
matrix  is  the  same  as  the  norm  of  the  vector  a ,  which  is 
constrained  to  be  1. 


Why  use  f(x)  =  x2  ?  Recall  from  Sec.  2  that  the 
main  clue  to  help  us  identify  whether  a  kernel  matrix 
A  is  good  is  that  A  should  not  seem  random.  Hence, 
we  can  turn  to  random  matrix  theory  to  help  us  define 
a  good  kernel.  The  Gaussian  Ensemble  (GE)  is  proba¬ 
bly  the  most  basic  symmetric  random  matrix.  It  is  de¬ 
fined  as  a  family  of  real  symmetric  n  x  n  matrices  A  such 
that  the  upper  triangular  elements  are  independent  Gaus- 
sians  with  zero-mean  and  variance  1  along  the  diagonal, 
and  variance  1/2  elsewhere  (Matlab:  A  =  randn(n)  ; 
A  =  (A+A'  )  /2; ).  Given  a  matrix  A  we  can  verify  its 
probability  to  be  in  this  family  simply  by  computing  the 
product  of  all  the  (n2  +  n)/ 2  independent  Gaussians,  i.e., 
P(A  €  GE)  ~  U^e-^TUe-^/2  =  n^e"^  = 
ell^llp/2.  Hence,  the  probability  of  a  matrix  A  being  ran- 
dom  is  proportional  to  the  exponent  of  the  square  of  its 
Frobenius  norm.  Recall  that  ||A|||,  =  trace(2L4T),  which 
is  the  sum  of  the  eigenvalues  of  AAr,  which  for  a  sym¬ 
metric  matrix,  is  the  sum  of  the  squares  of  its  eigenvalues. 
Hence,  a  matrix  with  a  high  sum  of  squared  eigenvalues  is 
not  likely  to  be  a  random  GE  matrix.  This  gives  a  direct 
motivation  to  the  use  of  this  score.  In  order  to  handle  the 
k  <  n  case  of  Algo.  1,  we  add  the  assumption,  common  to 
spectral  techniques,  that  the  information  is  contained  only 
in  the  matrix  elements  which  are  spanned  by  the  first  few 
eigenvectors  of  it.  Hence,  if  Q  is  the  n  x  k  matrix  holding 
the  eigenvectors  of  A,  and  recall  that  QQT  is  the  projection 
matrix  to  the  linear  subspace  spanned  by  the  columns  of  Q , 


then  if  QQT  A  has  a  high  score  it  is  unlikely  to  be  random. 
In  this  case,  the  score  is  only  approximated,  since  the  el¬ 
ements  of  QQT  A  are  not  independent,  even  for  a  random 
matrix. 


6.  Experiments 

Face  Recognition.  We  first  applied  our  methods  to  per¬ 
form  face  recognition.  We  used  two  publicly  available 
datasets:  the  YALE  dataset  [25]  and  the  AR  dataset  [13]. 
The  YALE  dataset  contains  15  different  persons,  each  one 
photographed  1 1  times  under  different  illumination,  under 
different  expression  and  with  or  without  glasses.  For  the 
AR  dataset  we  used  only  the  50  males,  each  having  14  im¬ 
ages.  For  both  datasets  we  created  20  instances  of  similar 
experiments,  where  3  images  per  person  were  picked  ran¬ 
domly  to  be  the  training  set  for  that  person.  The  rest  served 
as  test  set. 

We  compared  several  algorithms:  The  eigenface  method 
[20]  which  uses  PCA  to  reduce  the  dimension  of  the  face 
images;  the  fisherface  method  [2]  which  applies  multi-class 
Fisher  discriminant  analysis  to  face  images  after  they  have 
been  reduced  in  dimension  using  PCA;  Both  methods  after 
applying  conventional  feature  selection  algorithms;  an  ap¬ 
plication  of  eigenfaces  after  variable  weighting  using  Algo. 
2;  and  application  of  fisherfaces  after  variable  weighting 
using  Algo.  2.  All  methods  were  tested  twice:  for  gray 
level  images  and  for  wavelets.  As  expected,  Algo.  2  and 
all  other  3  feature  selection  algorithms  did  not  work  well 
when  dealing  with  gray  level  values  directly,  and  those  re¬ 
sults  are  omitted  from  the  table.  When  applying  eigenfaces 
and  fisherfaces  directly  to  the  data  without  any  feature  se¬ 
lection,  gray  level  performed  better  than  haar  wavelets. 

We  also  compared  the  results  with  similar  variants  using 
Algo.  1  and  3,  which  for  these  datasets  did  not  perform  bet¬ 
ter  than  Algo.  2.  This  was  probably  due  to  the  large  number 
of  classes  (person  identities)  used  in  each  experiment.  All 
of  these  are  omitted  from  the  table  below. 

The  values  in  each  cell  in  Table  1  are  the  average  recog¬ 
nition  rate  (in  percent)  over  20  runs,  and  the  standard  devi¬ 
ation  of  the  results.  Results  are  shown  for  all  methods  for 
the  PCA  dimension  which  gave  the  best  results.  For  fea¬ 
ture  selection  (FS)  we  show  the  best  out  of  three  supervised 
methods  (Pearson  coefficients,  Fisher  criterion  score,  and 
the  Kolmogorov- Smirnov  test)  and  for  the  best  percentile  of 
kept  features.  It  turns  out  that  for  these  datasets  the  Fisher 
criterion  score  did  the  best  out  of  the  three,  and  this  perfor¬ 
mance  occurred  when  keeping  40-50%  of  the  features. 

As  shown  in  Table  1,  the  use  of  wavelets  together  with 
Algo.  2  improves  the  results  both  for  eigenfaces  and  for 
fisherfaces.  For  the  YALE  dataset,  running  Fisher  using 
Algo.  2  with  weighting  of  wavelet  features  performed  better 
than  the  Fisher  results  on  gray  levels  for  75%  of  the  trials, 
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DataSet 

PCA 

PCA  +  Haar 

FS  +  PCA 

Algo.  2  +PCA 

Fisher 

Fisher  +  Haar 

FS  +  Fisher 

Algo.  2  +Fisher 

Yale 

AR 

70.0  =b  2.5 
62.9  zb  2.4 

64.5  zb  2.2 
32.2  zb  1.6 

67.6  zb  4.9 
62.9  =b  4.4 

81.2  =b  2.4 
70.6  zb  1.8 

83.4  =b  3.2 
91.9  zb  1.3 

81.8  zb  3.1 
90.4  ±  1.8 

83.4  zb  2.7 
90.9  zb  1.6 

88.2  zb  2.4 

94.6  ±  0.8 

Table  1 :  Face  recognition  precision  on  two  databases,  Yale  and  AR.  Results  indicate  percentage  of  test  examples  correctly  identified  plus 
or  minus  the  standard  deviation  across  20  independent  trials.  In  each  trial  3  images  per  person  served  as  training  images,  and  the  rest  as 
testing.  Results  are  recorded  for  seven  algorithms:  PC  A  is  an  application  of  eigenfaces;  PC  A  +  Haar  is  an  application  of  PC  A  on  Haar 
wavelet;  FS  +  PCA  is  the  best  filtering  method  out  of  Pearson  coefficient,  Fisher  score,  and  KS  test  applied  to  wavelet  coefficients  followed 
by  eignefaces;  Algo.  2  +  PCA  is  an  application  of  eigenfaces  after  re-weighting  the  haar  wavelet  data  according  to  Algo.  2.  The  rest  of  the 
columns  report  results  for  similar  experiments  but  with  Fisherfaces  instead  of  eigenfaces. 


Figure  3:  Top  two  rows  show  the  first  10  principle  components 
of  the  wavelets  weighted  by  Algo.  2  on  the  YALE  dataset.  Bottom 
rows  show  the  first  10  wavelet  features  picked  by  Algo.  2  on  the 
same  experiment. 


and  performed  the  same  for  15%.  For  the  AR  dataset,  the 
method  using  wavelets+Algo.  2+Fisher  gave  the  best  per¬ 
formance  for  all  trials. 

Figure  3  shows  the  first  principle  components  of  the 
wavelets  for  a  specific  YALE  dataset  experiment.  The 
wavelet  features  have  been  weighted  according  to  Algo.  2 
and  then  the  principle  components  were  computed.  Also 
shown  are  the  wavelets  chosen  by  this  algorithm.  The  algo¬ 
rithm  tends  to  choose  large/medium  sized  wavelets. 

Car  Detection.  We  applied  Algo.  2  to  a  car  detection  in 
static  images  task.  We  used  the  car  images  database  from 
UIUC  [1].  The  goal  is  to  locate  the  cars  accurately  in  a 
number  of  large  grayscale  images.  The  test  images  vary 
in  size,  and  may  contain  multiple  cars.  The  training  data 
consists  of  500  car  images  and  550  non-car  images. 

We  compared  four  sets  of  features:  the  gray  level  val¬ 
ues,  Haar  wavelets,  Haar  wavelets  weighted  by  Algo.  2, 
and  Haar  wavelets  weighted  by  Algo.  3.  We  transformed 
each  training  image  to  the  feature  space,  and  learned  a  tem¬ 
plate  using  a  linear  Support  Vector  Machine  5.  Using  the 
fact  that  the  transformation  to  the  wavelet  feature  space  was 
linear,  as  well  as  the  variable  weighting,  we  were  able  to 

5 For  Algo.  2  we  also  tried  LDA,  and  received  almost  identical  results 


Figure  4:  Template  for  car  detection  learned  via  SVM  from 
four  different  feature  sets,  (a)  histogram  equalized  gray  levels  (b) 
Haar  wavelets  (c)  Algo.  2  on  Haar  wavelets  (d)  Algo.  3  on  Haar 
wavelets. 


transform  the  learned  template  in  features  space  back  into 
an  image,  as  shown  in  Fig.  4.  We  searched  for  this  tem¬ 
plate  using  normalized  cross  correlation  in  the  test  images, 
and  then  used  the  same  neighborhood  suppression  as  de¬ 
scribed  in  [1],  yielding  a  set  of  car  detections  weighted  by 
strength  of  correlation.  When  compared  against  the  ground 
truth  locations  included  alongside  the  test  data,  one  is  able 
to  construct  a  Precision-Recall  (PR)  curve. 

The  PR  curves  we  calculated  are  presented  in  Fig.  5.  As 
shown  in  the  graph,  SVM  on  gray  levels  performs  at  about 
the  same  level  as  the  component  based  detector  presented 
in  [1].  SVM  on  wavelet  features  does  worse  and  SVM 
on  Algo.  2  weighted  wavelet  features  does  better.  SVM 
on  Algo.  3  weighted  wavelet  features  does  better  than  all 
other  methods.  Weights  returned  by  Algo.  3  were  extremely 
sparse,  with  over  90%  of  the  total  magnitude  of  the  weights 
assigned  to  only  25  of  the  wavelets.  Thus  Algo.  3  combined 
with  SVM  enables  the  control  of  the  run-time  of  the  learned 
classifier  much  like  Boosting  does. 

Place  recognition.  For  this  experiment  we  used  the  data 
collected  by  Torralba  et.  al.[19],  in  order  to  compare  su¬ 
pervised  learning  algorithms.  Given  an  image,  the  goal  is 
to  classify  it  into  one  of  ten  categories:  conference-room, 
corridor,  elevator  lobby,  inside  elevator,  kitchen,  lab,  office, 
open  area,  plaza  and  street.  The  data  consists  of  50,  757 
frames  collected  over  17  sequences  using  a  wearable  cam¬ 
era.  While  this  method  achieves  good  identification  results 
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Figure  5:  Precision-Recall  curve  for  the  UIUC  car  dataset.  SVM 
templates  are  learned  using  gray  levels,  Haar  wavelets,  Algo.  2 
weighted  wavelets  and  Algo.  3  weighted  wavelets. 

by  using  temporal  information,  the  task  of  identifying  a 
single  frame  is  quite  difficult.  For  example,  it  is  not  easy 
to  distinguish  a  lab  from  an  office  or  a  conference  room. 
Each  frame  in  the  dataset  was  represented  by  a  vector  of 
384  dimensions,  consisting  of  the  output  of  steerable  fil¬ 
ters  applied  to  the  input  120  x  160  image  at  several  scales. 
This  representation,  together  with  the  frame  annotation,  was 
made  publicly  available  by  the  authors  of  [19]. 

We  conducted  repeated  one-vs-all  experiments.  In  each 
experiment  100  random  examples  of  one  place  catagory 
served  as  the  set  of  positive  training  examples,  and  100  ran¬ 
dom  examples  from  the  rest  of  the  frames  served  as  negative 
examples.  The  results  were  then  tested  on  a  much  larger 
set  of  testing  examples,  which  contained  equal  numbers  of 
positive  and  negative  examples.  Each  such  one-vs-all  ex¬ 
periment  was  repeated  three  times. 

In  [19]  the  authors  used  the  gentle  boost  algorithm  on 
top  of  PC  A.  In  our  experiments,  SVM  always  outperfoms 
the  boosting  algorithm.  Moreover,  in  all  of  our  experi¬ 
ments,  PC  A  did  not  help  at  all  (not  even  for  boosting), 
hence  we  used  the  original  384  dimensional  data.  The  re¬ 
sults  we  obtained  are  26.67%  error  for  an  80  dimensional 
PCA  followed  by  a  linear  SVM,  25.10%  for  linear  SVM, 
and  23.17%  for  Algo.  2  weighting  of  the  features  followed 
by  an  SVM.  The  results  improve  to  22.83%  error  when  us¬ 
ing  Algo.  3  weighting  and  then  SVM. 

7.  Conclusions 

We  presented  variable  weighting  algorithms  that  were  de¬ 
signed  to  preprocess  the  data  before  applying  dimension¬ 
ality  reduction  algorithms.  In  their  parameter-free  form 
(Algo.  2,  and  Algo.  3),  they  are  very  simple  to  implement, 


and  can  be  applied  to  tens  of  thousands  of  features.  We 
show  that  using  these  algorithms  we  can  effectively  apply 
dimensionality  reduction  algorithms  to  datasets  of  uncorre¬ 
lated  variables,  benefiting  from  both  worlds  of  dimension¬ 
ality  reduction  and  feature  selection. 
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A.  Derivation  of  multi-class  kernel  FDA,  and  Alg.  3 


Consider  q  data  points  Mi ,  M2 , . . . ,  Mq  with  mean  p  which 
are  assigned  to  l  classes.  Each  class  Ci  contains  qi  exam¬ 
ples  with  mean  pi.  The  FDA  ([6])  examines  orthogonal 
directions  which  maximize  the  inter-class  variance,  while 
minimizing  each  inner-class  variance.  This  is  done  by  max- 
imizing  the  ratio  §r§^j,  where  SB  =  - 

p)(pi  —  p)T  is  the  between  class  scatter  matrix,  and  Sw  = 
Y^i= i  —  /Xi)(iV  —  Hi)T  is  the  within  class  scat¬ 

ter  matrix.  This  maximization  problem  is  solved  by  con¬ 
sidering  the  largest  generalized  eigenvectors  of  the  pair 
(Sb,  Sw)-  It  is  not  difficult  to  show  that  each  of  the  di¬ 
rections  w  returned  by  the  FDA  is  a  linear  combination 
w  = 

Let  i  =  1../  be  the  matrix  composed  only  of  mea¬ 
surements  arising  from  class  Ci.  Let  A  =  MT M,  = 
MT M W  ( q  x  qi  matrix),  and  let  lp  be  a  vector  of  p  ones. 
Substituting  the  expression  for  w  as  a  linear  combination 


w  =  My  in  the  ratio  to  be  maximized,  we  get:  WT^BW  - 

1  ’  &  W 1  SwW 

where  R  =  Eli=1  1*  -  \Alq){^A^lqi 


Ml,)rand5  =  EU^(0(^  “  ■ 


This  derivation  is  similar  to  [14],  except  that  we  handle 
multiple  classes,  and  therefore  need  to  extract  more  than 


one  direction  w.  Let  these  directions  be  noted  as 
i  =  1..Z.  Note  that  there  is  no  one-to-one  correspondence 
between  the  directions  and  the  classes.  Similar  to  FDA, 
we  constrain  every  two  extracted  directions  by 

w^Tw^  =  y(z)Ty[^/(j)  =  Sij.  To  make  the  form  of 
the  constraint  simpler,  we  transform  our  coordinate  sys¬ 
tem.  Let  A  =  EDEt  be  an  eigenvalue  decomposition 
of  A,  containing  only  eigenvectors  with  non-zero  eigenval¬ 
ues.  Define  =  DXI2ET 7W.  The  matrices  S  and  R 
are  spanned  by  the  column  space  of  the  matrix  A.  There¬ 
fore,  out  of  all  possible  solutions  to  this  linear  equation 
(j)0)  =  D1/2Et yW,  the  one  which  maximizes  our  crite¬ 
ria  is  given  by  yW  =  ED We  therefore 


maximize 


N  D 


E 1  RED 


-!/2, 


£TD-l/2ETSED-l/2(f) 


subject  to  the  directions 


being  orthogonal,  i.e 

Variable  selection  is  done  by  replacing  the  kernel  ma¬ 
trix  A  with  the  matrix  Aa  =  J2i= 1  aimi and  the  ma_ 
trices  A^  with  the  matrices  A&  =  Y17=i  > 

where  is  a  vector  containing  the  values  of  the  ith 
variable,  but  only  for  the  points  that  belong  to  class  C&. 
With  this  replacement,  based  on  the  formulas  for  S  and 
R,  both  the  numerator  and  the  denominator  are  linear  in 
a.  However  the  ratio  is  not.  We  therefore  investigate  an¬ 
other  way  to  maximize  the  numerator  and  minimize  the  de¬ 
nominator.  We  maximize:  (cj)T D~1/2ET RED~X!2 <p)  — 
A (p\)T D~x!2Et SED~x!2(\)).  Note  that  an  optimization 
function  which  is  similar  in  spirit  was  used  by  Jebara  [11] 
in  the  context  of  invariance  learning. 

The  optimization  problem  becomes  bi-linear  with  re¬ 
spect  to  a,  and  variable  selection  algorithms  simi¬ 
lar  to  the  ones  above  can  be  derived.  For  sim¬ 
plicity,  we  just  show  the  analog  of  Algo.  2,  cre¬ 
ated  by  maximizing  trace(((f)T 2 ET RE D-1/ 2 (f))  — 
\((f)T D~1/2ET SED-1/2^))  over  a.  Algo.  3  above  is  sim¬ 
ply  achieved  by  rearranging  the  terms  of  the  maximization 
problem,  such  that  the  bilinear  nature  is  apparent. 
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